Big is beautiful Bootstrapping a PoS tagger for Swedish

نویسنده

Eva Forsbom

چکیده

A statistical part-of-speech tagger trained on a one-million word Swedish corpus with validated tags was used to tag two considerably larger untagged corpora (≈ 78 and 20 million words, respectively) to bootstrap new, improved, tagger models. The new taggers all showed better accuracy both for seen and unseen words, and the best tagger had 97.02% overall accuracy evaluated on the original corpus (using 10-fold cross-validation).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Size is not Everything Genre Balance in Bootstrapping a Swedish PoS Tagger

Part-of-speech tagging is a basic component of natural language processing, and as such, needs to be as accurate as possible, or any subsequent processing will suffer. For Swedish, most tagger models are trained on the Stockholm-Umeå Corpus (SUC Ejerhed et al., 2006). As SUC is a balanced corpus, SUC models are better representatives for general language than models trained on news texts only, ...

متن کامل

Extending the View: Explorations in Bootstrapping a Swedish PoS Tagger

State-of-the-art statistical part-of-speech taggers mainly use information on tag bior trigrams, depending on the size of the training corpus. Some also use lexical emission probabilities above unigrams with beneficial results. In both cases, a wider context usually gives better accuracy for a large training corpus, which in turn gives better accuracy than a smaller one. Large corpora with vali...

متن کامل

بررسی مقایسه‌ای تأثیر برچسب‌زنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی

In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...

متن کامل

Stagger: A modern POS tagger for Swedish

The field of Part of Speech (POS) tagging has made slow but steady progress during the last decade, though many of the new methods developed have not previously been applied to Swedish. I present a new system, based on the Averaged Perceptron algorithm and semi-supervised learning, that is more accurate than previous Swedish POS taggers. Furthermore, a new version of the Stockholm-Umeå Corpus i...

متن کامل

Some applications of a statistical tagger for Swedish

We will brie y describe a part-of-speech (POS) tagger for Swedish and discuss some applications: rule-based and probabilistic grammar checking, word prediction and keyword extraction. In POS tagging of a text, each word and punctuation mark in the text is assigned a morphosyntactic tag. We have designed and implemented a tagger based on a second order Hidden Markov Model [1]. Given a sequence o...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Big is beautiful Bootstrapping a PoS tagger for Swedish

نویسنده

چکیده

منابع مشابه

Size is not Everything Genre Balance in Bootstrapping a Swedish PoS Tagger

Extending the View: Explorations in Bootstrapping a Swedish PoS Tagger

بررسی مقایسه‌ای تأثیر برچسب‌زنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی

Stagger: A modern POS tagger for Swedish

Some applications of a statistical tagger for Swedish

عنوان ژورنال:

اشتراک گذاری